57 results found.
Written
Corpus,
Language Type:
Multilingual
Languages:
Arabic Bulgarian Catalan Croatian Czech Danish Dutch English Estonian Filipino Finnish French German Greek Hebrew Hindi Hungarian Indonesian Italian Japanese Korean Latvian Lithuanian Malay Norwegian Persian Polish Portuguese Romanian Russian Serbian Simplified Chinese Slovak Slovenian Spanish Swedish Thai Traditional Chinese Turkish Ukrainian Vietnamese
Availability:
Freely Available
License:
CC-BY-SA
Size:
60 GByte Production Status:
Newly created-in progress
Use:
Language Modelling
-
Paper title:Wiki-40B: Multilingual Language Model Dataset
-
Paper track:Written/oral presentation
-
Paper status:Accept Oral
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Rami Al-Rfou | Wiki40B-LM | /N |
Documentation:
None
Written
Corpus,
Language Type:
Monolingual
Languages:
Afrikaans Albanian Arabic Armenian Bangla Basque Bosnian Breton Bulgarian Catalan Croatian Czech Danish Dutch English Esperanto Estonian Filipino Finnish French Galician Georgian German Greek Hebrew Hindi Hungarian Icelandic Indonesian Italian Japanese Kazakh Korean Latvian Lithuanian Macedonian Malay Malayalam Norwegian Persian Polish Portuguese Romanian Russian Serbian Sinhala Slovak Slovenian Spanish Swedish Tamil Telugu Thai Turkish Ukrainian Urdu Vietnamese pt_br ze_en ze_zh zh_cn zh_tw
Availability:
Freely Available
License:
<Not Specified>
Size:
22.10G tokens Production Status:
Existing-used
Use:
Machine Translation, SpeechToSpeech Translation
-
Paper title:word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs
-
Paper track:Written/oral presentation
-
Paper status:Accept Poster
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Yo Joong Choe | OpenSubtitles2018 | /N |
Documentation:
Yes, on the website.
Written
Lexicon,
Language Type:
Monolingual
Languages:
Afrikaans Albanian Arabic Armenian Bangla Basque Bosnian Breton Bulgarian Catalan Croatian Czech Danish Dutch English Esperanto Estonian Filipino Finnish French Galician Georgian German Greek Hebrew Hindi Hungarian Icelandic Indonesian Italian Japanese Kazakh Korean Latvian Lithuanian Macedonian Malay Malayalam Norwegian Persian Polish Portuguese Romanian Russian Serbian Sinhala Slovak Slovenian Spanish Swedish Tamil Telugu Thai Turkish Ukrainian Urdu Vietnamese pt_br ze_en ze_zh zh_cn zh_tw
Availability:
Freely Available
License:
CreativeCommons Attribution 4.0 International
Size:
41 GByte Production Status:
Newly created-finished
Use:
Machine Translation, SpeechToSpeech Translation
-
Paper title:word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs
-
Paper track:Written/oral presentation
-
Paper status:Accept Poster
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Yo Joong Choe | word2word | /N |
Documentation:
Yes, on the website.
Multimodal/Multimedia
Corpus,
Language Type:
Monolingual
Languages:
Adyghe Albanian Ancient Greek Arabic Armenian Asturian Basque Belarusian Bulgarian Catalan Church Slavic Classic Syriac Classical Armenian Czech Danish Dutch English Estonian Faroese Finnish Georgian German Gothic Hindi Hungarian Icelandic Ingrian Irish Kabardian Kalaallisut Kannada Kazakh Khakas Latin Latvian Lithuanian Livonian languages Low German Lower Sorbian Macedonian Maltese Middle French Middle High German Middle Low German Modern Greek Neapolitan Northern Sami Occitan Old English Old French Old Irish Old Saxon Pashto Persian Polish Portuguese Romanian Slovenian Spanish Swedish Tibetan Turkish Turkmen Ukrainian Urdu Veps Votic Welsh
Availability:
Freely Available
License:
Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Size:
557.3 MByte Production Status:
Newly created-in progress
Use:
Morphological Analysis
-
Paper title:Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus
-
Paper track:Multimodality/oral presentation
-
Paper status:Accept Poster
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Eleni Metheniti | Wikinflection Corpus | /N |
Documentation:
https://github.com/lenakmeth/Wikinflection-Corpus/blob/master/README.md
Written
Language Modeling Tool,
Language Type:
Monolingual
Languages:
Romanian
Availability:
Freely Available
License:
Apache 2
Size:
3.5 GByte Production Status:
Newly created-finished
Use:
Language Modelling
-
Paper title:RoBERT – A Romanian BERT Model
-
Paper track:Long paper/
-
Paper status:Accept Poster
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Stefan Ruseti | RoBERT-large | /N |
Documentation:
None
Written
Language Modeling Tool,
Language Type:
Monolingual
Languages:
Romanian
Availability:
Freely Available
License:
Apache 2
Size:
1.2 GByte Production Status:
Newly created-finished
Use:
Language Modelling
-
Paper title:RoBERT – A Romanian BERT Model
-
Paper track:Long paper/
-
Paper status:Accept Poster
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Stefan Ruseti | RoBERT-base | /N |
Documentation:
None
Written
Language Modeling Tool,
Language Type:
Monolingual
Languages:
Romanian
Availability:
Freely Available
License:
Apache 2
Size:
205 MByte Production Status:
Newly created-finished
Use:
Language Modelling
-
Paper title:RoBERT – A Romanian BERT Model
-
Paper track:Long paper/
-
Paper status:Accept Poster
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Stefan Ruseti | RoBERT-small | /N |
Documentation:
None
Written
Corpus,
Language Type:
Monolingual
Languages:
Bulgarian Croatian Czech Danish Dutch English Estonian Finnish French German Greek Hungarian Icelandic Irish Italian Latvian Lithuanian Maltese Polish Portuguese Romanian Slovak Slovenian Spanish Swedish
Availability:
Freely Available
License:
CC-0
Size:
341856530 sentences Production Status:
Newly created-in progress
Use:
Machine Translation, SpeechToSpeech Translation
-
Paper title:ParaCrawl: Web-Scale Acquisition of Parallel Corpora
-
Paper track:Long/Resources and Evaluation
-
Paper status:Accept
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Main Contact | Philipp Koehn | ParaCrawl | /N |
Documentation:
None
Speech
Corpus,
Language Type:
Multilingual
Languages:
Romanian
Availability:
Freely Available
License:
OpenSource
Size:
2500 sentences Production Status:
Newly created-in progress
Use:
Speech Recognition/Understanding
-
Paper title:Crowd-sourced, automatic speech-corpora collection – building the Romanian Anonymous Speech Corpus
-
Paper track:<Not Specified>
-
Paper status:Accept
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Author 1 | Stefan Daniel Dumitrescu | Research Institute for Artificial Intelligence | RO |
| Author 2 | Tiberiu Boroș | Research Institute for Artificial Intelligence, Romanian Academy | RO |
| Author 3 | Radu Ion | Research Institute for Artificial Intelligence, Romanian Academy | RO |
| Main Contact | Tiberiu Boroș | Adobe Systems Romania | None |
Documentation:
<Not Specified>Language Type:
Multilingual
Languages:
Romanian
Availability:
From Owner
License:
<Not Specified>
Size:
<Not Specified> <Not Specified>Production Status:
Existing-used
Use:
any natural language processing task on Romanian
-
Paper title:The Romanian Neuter Examined Through A Two-Gender N-Gram Classification System
-
Paper track:General issues
-
Paper status:Accept Poster
| Author Number | Name | Affiliation | Country |
|---|---|---|---|
| Author 1 | Liviu P. Dinu | University of Bucharest | None |
| Author 2 | Vlad Niculae | University of Bucharest | None |
| Author 3 | Octavia-Maria Şulea | University of Bucharest | None |
| Main Contact | Liviu P. Dinu | Faculty of Mathematics and Computer Science, University of Bucharest | RO |
Documentation:
<Not Specified>




